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Abstract. We address the problem of checking state reachability for 
programs running under Total Store Order (TSO). The problem has been 
shown to be decidable but the cost is prohibitive, namely non-primitive 
recursive. We propose here to give up completeness. Our contribution is 
a new algorithm for TSO reachability: it uses the standard SC semantics 
and introduces the TSO semantics lazily and only where needed. At 
the heart of our algorithm is an iterative refinement of the program of 
interest. If the program’s goal state is SC-reachable, we are done. If the 
goal state is not SC-reachable, this may be due to the fact that SC under¬ 
approximates TSO. We employ a second algorithm that determines TSO 
computations which are infeasible under SC, and hence likely to lead to 
new states. We enrich the program to emulate, under SC, these TSO 
computations. Altogether, this yields an iterative under-approximation 
that we prove sound and complete for bug hunting, i.e., a semi-decision 
procedure halting for positive cases of reachability. We have implemented 
the procedure as an extension to the tool Trencher [I] and compared it 
to the Memorax [2] and CBMC m model checkers. 


1 Introduction 

Sequential consistency (SC) [H] is the semantics typically assumed for parallel 
programs. Under SC, instructions are executed atomically and in program order. 
When programs are executed on an Intel x86 processor, however, they are only 
guaranteed a weaker semantics known as Total Store Order (TSO). TSO weakens 
the synchronization guarantees given by SC, which in turn may lead to erroneous 
behavior. TSO reflects the architectural optimization of store buffers. To reduce 
the latency of memory accesses, store commands are added to a thread-local 
FIFO buffer and only later executed on memory. 

To check for correct behavior, reachability techniques have proven useful. 
Given a program and a goal state, the task is to check whether the state is 
reachable. To give an example, assertion failures can be phrased as reachability 
problems. Reachability depends on the underlying semantics. Under SC, the 
problem is known to be PSPACE-complete m- Under TSO, it is considerably 
more difficult: although decidable, it is non-primitive recursive-hard [5]. 

Due to the high complexity, tools rarely provide decision procedures mMM- 
Instead, most approaches implement approximations. Typical approximations 
of TSO reachability bound the number of loop iterations EE], the number of 
context switches between threads [5], or the size of store buffers [Ullin]. What 
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all these approaches have in common is that they introduce store buffering in 
the whole program. We claim that such a comprehensive instrumentation is 
unnecessarily heavy. 

The idea of our method is to introduce store buffering lazily and only where 
needed. Unlike [5], we do not target completeness. Instead, we argue that our lazy 
TSO reachability checker is useful for a fast detection of bugs that are due to the 
TSO semantics. At a high level, we solve the expensive TSO reachability problem 
with a series of cheap SC reachability checks — very much like SAT solvers are 
invoked as subroutines of costlier analyses. The SC checks run interleaved with 
queries to an oracle. The task of the oracle is to suggest sequences of instructions 
that should be considered under TSO, which means they are likely to lead to 
TSO-reachable states outside SC. 

To be more precise, the algorithm iteratively repeats the following steps. 
First, it checks whether the goal state is SC-reachable. If this is the case, the 
state will be TSO-reachable as well and the algorithm returns. If the state is not 
SC-reachable, the algorithm asks the oracle for a sequence of instructions and 
encodes the TSO behavior of the sequence into the input program. As a result, 
precisely this TSO behavior becomes available under SC. The encoding is linear 
in the size of the input program and in the length of the sequence. 

The algorithm is a semi-decision procedure: it always returns correct answers 
and is guaranteed to terminate if the goal state is TSO-reachable. This guarantee 
relies on one assumption on the oracle. If the oracle returns the empty sequence, 
then the SC- and the TSO-reachable states of the input program have to coincide. 
We also come up with a good oracle: robustness checkers naturally meet the 
above requirement. Intuitively, a program is robust against TSO if its partial 
order-behaviors (reflecting data and control dependencies) under TSO and under 
SC coincide. Robustness is much easier than TSO reachability, actually PSpace- 
complete [TnHH], and hence well-suited for iterative invocations. 

We have implemented lazy TSO reachability as an extension to our tool 
Trencher [T], reusing the robustness checking algorithms of Trencher to 
derive an oracle. The implementation is able to solve positive instances of TSO 
reachability as well as correctly determine safety for robust programs. The source 
code and experiments are available online [T]. 

The structure of the paper is as follows. We introduce parallel programs with 
their TSO and their SC semantics in Section [51 Section [3] presents our main 
contribution, the lazy approach to solving TSO reachability. Section 0] describes 
the robustness-based oracle. The experimental evaluation is given in Section O 
Details and proofs missing in the main text can be found in the appendix. 


Related Work 

As already mentioned, TSO reachability was proven decidable but non-primitive 
recursive [5] in the case of a finite number of threads and a finite data domain. In 
the same setting, robustness was shown to be PSPACE-complete m Checking 
and enforcing robustness against weak memory models has been addressed in 
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diiiiinHiiiis]- The first work to give an efficient sound and complete decision 
procedure for checking robustness is m- 

The works [11121111] propose state-based techniques to solve TSO reacha¬ 
bility. An under-approximative method that uses bounded context switching is 
given in H. It encodes store buffers into a linear-size instrumentation, and the 
instrumented program is checked for SC reachability. The under-approximative 
techniques of 011 are able to guarantee safety only for programs with bounded 
loops. On the other side of the spectrum, over-approximative analyses abstract 
store buffers into sets combined with bounded queues [Biin]- 


2 Parallel Programs 


We use automata to define the syntax and the semantics of parallel programs. A 
(non-deterministic) automaton over an alphabet A is a tuple A = (A, S, sq)-, 
where S' is a set of states, ^ C S x (A U {e}) x S is a set of transitions, and 
So G S is an initial state. The automaton is finite if the transition relation —^ 
is finite. We write s — > s' if (s, a, s') G —and extend the transition relation to 
sequences w G A* as expected. The language of A with final states F C S is 
jCf{A) := {w G a* I So ^ s G A}. We say that state s G S is reachable if 
So ^ s for some sequence w G A*. Letter a precedes b in w, denoted by a <w b, 
ii w = wi • a ■ W2 • b ■ W3 for some wi,W2,W3 G A*. 


ti -^O 90,1 


6 91,1 


A parallel program A is a finite sequence of threads that are identified by 
indices t from TID. Each thread t := {Comt, Qt, It, Qo,t) i® ^ finite automaton with 
transitions It that we call instructions. Instructions It are labelled by commands 
from the set Comt which we define in the next paragraph. We assume, wlog., that 
states of different threads are disjoint. This implies that the sets of instructions 
of different threads are distinct. We use / := l+J^gjiQ h for all instructions and 
Com := UtGTiD Comt for all commands. For an instruction inst := (s, cmd, s') 
in I, we define cmd{inst) := cmd, src{inst) := s, and dst{inst) := s'. 

To define the set of commands, let 
DOM be a finite domain of values that 
we also use as addresses. We assume 
that value 0 is in DOM. For each thread 
t, let REGt be a finite set of registers 
that take their values from DOM. We 
assume per-thread disjoint sets of reg¬ 
isters. The set of expressions of thread 
t, denoted by EXPj, is defined over reg¬ 
isters from REGt, constants from DOM, 
and (unspecified) operators over DOM. If r G REGt and e,e' G EXPt, the 
set of commands Comt consists of loads from memory r •<— mem[e], stores 
to memory mem[e] •<— e!, memory fences mfence, assignments r •<— e, and 
conditionals assume e. We write REG := l+JtgjiD ^^Gt for all registers and 
EXP :=UteT,D EXPt for all expressions. 


ri 


ti ^0 
mem[y] ■<— 1 

6 91,2 

mem[y] ri •<— mem[x 

P 92,1 p 92,2 

assume ri = 0 assume r 2 = 0 

O 99,1 O 99,2 

Fig. 1. Simplified Dekker’s algorithm. 
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The program in Figure [T] serves as our running example. It consists of two 
threads ti and ^2 implementing a mutual exclusion protocol. Initially, the ad¬ 
dresses X and y contain 0. The first thread signals its intent to enter the critical 
section by setting variable x to 1. Next, the thread checks whether the second 
thread wants to enter the critical section, too. It reads variable y and, if it is 
0, the first thread enters its critical section. The critical section actually is the 
state ggp. The second thread behaves symmetrically. 


2.1 Semantics of Parallel Programs 

The semantics of a parallel program P under memory model M = TSO and 
M = SC follows [25) . We define the semantics in terms of a state-space automaton 
Xm{P) ■= {E,Sm,^m,so)- Each state s = (pc,val, buf) £ Sm is a tuple where 
the program counter pc: TID Q holds the current control state of each thread, 
the valuation val: REG U DOM —> DOM holds the values stored in registers and 
at memory addresses, and the buffer configuration buf: TID^’ (DOM x DOM)* 
holds a sequence of address-value pairs. 

In the initial state sq := (pCg, valg, bufo), the program counter holds the 
initial control states, pcQ(t) := go,t for all t G TID, all registers and addresses 
contain value 0, and all buffers are empty, bufo(t) := e for all t G TID. 

The transition relation Z\tso for TSO satisfies the rules given in Figure [2j 
There are two more rules for register assignments and conditionals that are 
standard and omitted. TSO architectures implement (FIFO) store buffering, 
which means stores are buffered for later execution on the shared memory. Loads 
from an address a take their value from the most recent store to address a that is 
buffered. If there is no such buffered store, they access the main memory. This is 
modelled by the Rules (LB) and (LM). Rule (ST) enqueues store operations as 
address-value pairs to the buffer. Rule (MEM) non-deterministically dequeues 
store operations and executes them on memory. Rule (F) states that a thread can 
execute a fence only if its buffer is empty. As can be seen from Figure O events 
labelling TSO transitions take the form E C TID x (/ U {flush}) x (DOM U {T}). 

The SC [21] semantics is simpler than TSO in that stores are not buffered. 
Technically, we keep the set of states but change the transitions so that Rule (ST) 
is immediately followed by Rule (MEM). 

We are interested in the computations of program P under M G {TSO, SC}. 
They are given by Cm(P) := £f(-^m(P)), where F is the set of states with empty 
buffers. With this choice of final states, we avoid incomplete computations that 
have pending stores. Note that all SC states have empty buffers, which means the 
SC computations form a subset of the TSO computations: Csc{P) C Ctso(E). 
We will use notation ReachuiP) for the set of all states s € F that are reachable 
by some computation in Cm(P). 

To give an example, the program from Figure [T] admits the TSO computation 
Twit below where the store of the first thread is flushed at the end: 

Avit = storei • loadi • store 2 • flush 2 • load 2 • flushi. 
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cmd = r •«— mem[ea] buf(t)4,({a} x DOM) = (a, v) ■ P 

(t,inst,a) , 1 u 

s - > (pc , val[r := v\, but) 


(LB) 


cmd = r <r~ mem[ea] but (t) X ({a} x DOM) = e 


s (pc', val[r := val(a)], buf) 

cmd = mem[ea] t— e« 

s (pc', val, buf[t := (a, v) ■ buf(t)]) 


(LM) 
(ST) 


buf(t) = /3 • (a, v) 


(t.flush.a) , .r 1 . V 

s -(pc, val[a := w], buf[t ;= /?]) 


(MEM) 


cmd = mf ence buf(t) = e 
s - > (pc , val, but) 


(F) 


Fig. 2. Transition rules for Xtso(R) assuming s = (pc, val, buf) with pc(t) = q and 
inst = (q, cmd, q') in thread t. The program counter is always set to pc' = pc[t q']. 
We assume a = e)i to be the address returned by an address expression Ca and v = ei 
the value returned by a value expression e„. We use buf(t) ({a} x DOM) to project 
the buffer content buf(t) to store operations that access address a. 

Consider an event e = (t, inst, a). By thread{e) := t we refer to the thread 
that produced the event. Function inst{e) := inst returns the instruction. 
For flush events, inst{e) gives the instruction of the matching store event. By 
addr{e) := a we denote the address that is accessed (if any). In the example, 

7 7/ \ / \ insin[x]^—1 17 7/ \ 

thread [St orei) = ti, inst{stoTei) = go,i -^ cind aaar’(storei) = x. 


3 Lazy TSO Reachability 

We introduce the reachability problem and present our main contribution: an 
algorithm that checks TSO reachability lazily. The iterative algorithm queries an 
oracle to identify sequences of instructions that, under the TSO semantics, lead 
to states not reachable under SC. In Section im we show that the algorithm 
yields a sound and complete semi-decision procedure. 

Given a memory model M G {SC, TSO}, the M reachability problem expects 
as input a program P and a set of goal states G C Sm- We are mostly interested 
in the control state of each thread. Therefore, goal states (pc, val, buf) typically 
specify a program counter pc but leave the memory valuation unconstrained. 
Formally, the M reachability problem asks if some state in G is reachable in the 
automaton Xm(F). 

Given: A parallel program P and goal states G. 

Problem: Decide CpnGi^MiP)) ^ 0- 

We use notation Reachyi{P) O G for the set of reachable final goal states in P. 

Instead of solving reachability under TSO directly, the algorithm we propose 
solves SC reachability and, if no goal state is reachable, tries to lazily introduce 
store buffering on a certain control path of the program. The algorithm delegates 
choosing the control path to an oracle function O. Given an input program R, 
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the oracle returns a sequence of instructions /* in that program. Formally, the 
oracle satisfies the following requirements: 

— If 0{R) = £ then ReachsciR) = ReachTsoiR)■ 

— Otherwise, 0{R) = instiinst 2 ■ ■ ■ instn with cmd{insti) a store, cmd{instn) 
a load, cmd{insti) ^ mf ence, and dst{insti) = src{insti+i) for i G [l..u — 1]. 

The lazy TSO reachability checker is outlined in Algorithm [T] As input, it 
takes a program P and an oracle O. We assume some control states in each 
thread to be marked to define a set of goal states. The algorithm returns true 
iff the program can reach a goal state under TSO. It works as follows. First, it 
creates a copy R of the program P. Next, it checks if a goal state is SC-reachable 
in R (Line|3]). If that is the case, the algorithm returns true. Otherwise, it asks 
the oracle O where in the program to introduce store buffering. If 0{R) ^ £, 
the algorithm extends R to emulate store buffering on the path 0{R) under SC 
(Line|Sl). Then it goes back to the beginning of the loop. If 0{R) = e, by the first 
property of oracles, R has the same reachable states under SC and under TSO. 
This means the algorithm can safely return false iLine [T0|l. Note that, since R 
emulates TSO behavior of P, the algorithm solves TSO reachability for P. 


Algorithm 1 Lazy TSO reachability Checker 
Input: Marked program P and oracle O 
Output: true if some goal state is TSO-reachable in P 
false if no goal state is TSO-reachable in P 

1: R := P; 

2: while true do 

3: if Reachsc{P) n G ^ 0 then {check if some goal state is SC-reachable} 

4: return true; 

5: else 

6: a := 0{R); (ask the oracle where to use store buffering} 

7: if CT ^ £ then 

8: R:=R®a; 

9: else 

10: return false; 


Let a := 0{R) = instiinst 2 ■■■ instn and let t := {Comt,Qt, It,qo,t) be 
the thread of the instructions in a. The modified program R (B a replaces t by 
a new thread t (B cr. The new thread emulates under SC the TSO semantics 
of a. Formally, the extension of t by a is t (B cr := {Com[,Q^, The 

thread is obtained from t by adding sequences of instructions starting from 
qQ := src{insti). To remember the addresses and values of the buffered stores, 
we use auxiliary registers ari,..., acmax and vri ,..., UTmax, where max < n — 1 is 
the total number of store instructions in a. The sets Com'^ D Comt and QJ A Qt 
are extended as necessary. 
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We define the extension by describing the new transitions that are added 
to // for each insti. In our construction, we use a variable count to keep track 
of the number of store instructions already processed. Initially, Q[ := Qt and 
count := 0. Based on the type of instructions, we distinguish the following cases. 

If cmd{msti) — mem[e] ■<— e', we increment count by 1 and add instructions 
that remember the address and the value being written in arcount and vrcount • 

If cmd{insti) = r mem[e], we add instructions to // that perform a load 
from memory only when a load from the simulated buffer is not possible. More 
precisely, if j G [I, count] is found so that arj = e, register r is assigned the 
value of vrj. Otherwise, r receives its value from the address indicated by e. 



If cmd(insti) is an assignment or a conditional, we add cmd(insti), 9i) 

to //. By the definition of an oracle, cmd(insti) is never a fence. 

The above cases handle all instructions in a. So far, the extension added new 
instructions to // that lead through the fresh states ,..., . Out of control 

state g„, we now recreate the sequence of stores remembered by the auxiliary 
registers. Then we return to the control flow of the original thread t. 

mem[ari] vri mem[armax] w^max 

Qn O-^->-0 • • • O-——->-0 dst{instn) 

Next, we remove insti from the program. This prevents the oracle from 
discovering in the future another instruction sequence that is essentially the 
same as a. As we will show, this is key to guaranteeing termination of the 
algorithm for acyclic programs. However, the removal of insti may reduce the 
set of TSO-reachable states. To overcome this problem, we insert additional 
instructions. Consider an instruction inst G It with src{inst) = src(insti) for 
some i G [l..n] and assume that inst ^ insti. We add instructions that recreate 
the stores buffered in the auxiliary registers and return to dst(inst). 

_ mem[ari] vri mem[arcount] ^ VJ'count cmd{inst) 

Qi O-- - ->-0 • • • O-—->-0---»-0 dst{inst) 

Similarly, for all load instructions insti as well as out of qi we add instructions 
that flush and fence the pair (ari, vri ), make visible the remaining buffered 
stores, and return to state q in the original control flow. Below, q := src(insti) if 
insti is a load and q := dst(insti), otherwise. Intuitively, this captures behaviors 
that delay insti past loads earlier than inst„, and that do not delay insti past 
the first load in a. 

mem[ari] ^ uri mfence inem[arcount] ^ VJ’count 

qi O-• • • O- q 
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t2 

i 

O ?0,2 



“[y] 


o ai,2 


r 2 ■(— mem[x] 


6 92,2 


assume r 2 = 0 

I 

O 99,2 

Fig. 3. Extension by mst(storei) ■ mst(loadi) of the program in Figure[T] Goal state 
(pc, val, buf) with val(x) = val(y) = 1 and val(ri) = val(r 2 ) = 0 is now SC-reachable. 

Figure [3] shows the extension of the program in Figure [1] by the instruction 

, \ / \ mem[x]-<—1 ri-<—mem[y] 

sequence znsi(storei) • 2?25i(loadi) := go,i -^ Qi.i -^ 


3.1 Soundness and Completeness 

We show that Algorithm [T] is a decision procedure for acyclic programs. From 
here until (inclusively) Theorem |3] we assume that all programs are acyclic, i.e., 
their instructions and control states form directed acyclic graphs. Theorem H] 
then explains how Algorithm [T] yields a semi-decision procedure for all programs. 

We first prove the extension sound and complete (Lemma [IJ : extending R 
by sequence a := 0{R) does neither add nor remove TSO-reachable states. 
Afterwards, Lemma [2] shows that if Algorithm [T] extends i? by ct (Line|8]) then, 
in subsequent iterations of the algorithm, no new sequence returned by the oracle 
is the same as a (projected back to P). Next, by the first condition of an oracle 
and using Lemma [5J we establish that Algorithm [T] is a decision procedure for 
acyclic programs (Theorem[2]). Finally, we show that Algorithm [T] can be turned 
into a semi-decision procedure for all programs using a bounded model checking 
approach (Theorem [d]). 

Lemma 1 Let DOM U REG be the addresses and registers of program R and 
let a := 0{R). Then we have (pc, val, buf) G ReachTso{R) */ o-nd only if 
(pc, val', buf) e ReachTSo{R ® o') with val(a) = val'(a) for all a G DOMU REG. 

Let t be the thread that differs in R and i?©(T. To prove Lemma [TJ one can show 
that for any prefix a' oi a G Ct%o{R) there is a prefix jd' oi fd G Ctso(.R © o'), 
and vice versa, that maintain the following invariants. 

Inv-0 So (pc, val, buf) and sq (pc', val', buf'). 

Inv-1 If pc and pc' differ, they only differ for thread t. If pc(t) ^ pc'(t), then 
pc(t) = dst{insti) and pc'(t) = for some i G [l..n — 1]. 

Inv-2 val'(a) = val(a) for all a G DOM U REG. 

Inv-3 buf and buf' differ at most for t. If buf(t) ^ buf'(t), then pc'(<) = 
for some i G [I..n — I] and buf(t) = wWiTt) ■ • • («?d,f^) ■ buf'(t) where 

count stores are seen along a from src{insti) to dst{insti). 
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We now show that the oracle never suggests the same sequence a twice. Since 
in i? © cr we introduce new instructions that correspond to instructions in i?, 
we have to map back sequences of instructions in i? © cr to sequences of 
instructions / in i?. Intuitively, the mapping gives the original instructions from 
which the sequence was produced. Formally, we define a family of projection 
functions I* with ha-^e) := e and ha-{w ■ inst) := ha-{w) ■ ha-{inst). For 

an instruction inst G 7®, we define hc{inst) := inst provided inst G I. We set 
hu{inst) := insti if inst is a first instruction on the path between qj^_i and 
for some i G [l..n]. In all other cases, we delete the instruction, h„{inst) := e. 
Then, if i?o := P is the original program, aj is the sequence that the oracle 
returns in iteration j G N of the while loop, and w is a sequence of instructions 
in Rj+i, we define h(w) := .. h^^{w)). This latter function maps sequences 

of instructions in program Rj+i back to sequences of instructions in P. 

We are ready to state our key lemma. Intuitively, if the oracle in Algorithm[T] 
returns a := 0{R) and a' := 0{R © a) then, necessarily, h{a') ^ h{a). 

Lemma 2 Let Rq := P and := Ri ©Ui for Ui := 0{Ri) as in AlgorithmiJ} 
If aj+i e then h{aj+i) h[ai) for all i < j. 

Proof. Assume, to the contrary, that h(aj+i) = h{ai) for some i < j where 
tTj+i := 0 { Rj + i ) and := 0 { Ri ). Let msfgrst be the first (store) instruction 
and msfiast the last (load) instruction of cfj+i- Similarly, let and inst[^.^ 

be the first and last instructions of cr^. Since /i(crj+i) = h{(Ji) it means that 
h{instfirst) = h{inst'f^rst) and h{instust) = h{inst[^^). 

However, since all control flows of Ri+i := Ri® <7 i that recreate h{inst'f^rst) 
before h{inst[^f) also place a fence between the two, no other later sequences 
that the oracle returns have h{inst'f^rst) come before h{inst[^^). This in particular 
means that Uj+i = 0{Rj+i) where /i(mstfirst) comes before /i(mstiast) does not 
exist. In conclusion, the initial assumption is false. □ 

We can now prove Algorithm [1] sound and complete for acyclic programs 
(Theorem [3]) . Lemma [2] and the assumption that the input program is acyclic 
ensure that if no goal state is found SC-reachable (Line H]) , then Algorithm [I] 
eventually runs out of sequences cr to return (Linel?]). If that is the case, 0{R) 
returns e in the last iteration of Algorithm [TJ By the first oracle condition, we 
know that the SC- and TSO-reachable states of R are the same. Hence, no goal 
state is TSO-reachable in R and, by Lemma [TJ no goal state is TSO-reachable 
in the input program P either. Otherwise, a goal state s is SC-reachable by 
some computation r in Rj for some j G N and, by Lemma [1] there is a TSO 
computation in P corresponding to r that reaches s. 

Theorem 3 For acyclic programs, Algorithm\J\terminates. Moreover, it returns 
true on input P if and only if ReachrsaiP) H G 0. 

Proof. It is immediate that Algorithm [T] always terminates for acyclic programs. 
On the one hand, the number of instruction sequences that start with a store 
and end with a load as in the second oracle condition are finite in P. On the 
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other hand, by Lemma [21 at each iteration the oracle returns a sequence that 
differs (in P) from the previous ones. These two facts imply termination. 

We now prove that ReachTso{P) n G 7 ^ 0 iff. Algorithm |T] returns true 
on input P. For the easy direction, assume that Algorithm |T] returns true on 
input P. This means that Reach^c{R) H G 7 ^ 0 in the last iteration of the 
algorithm’s loop. Then, by ReachsciR) Q ReachT'so{R) and LemmalU we know 
that Reachsc{R) ^ ReachTso{P)- Hence, ReachTsoiP) n G 7 ^ 0. 

For the reverse direction, assume that ReachTso{P) H G 7 ^ 0. Furthermore, 
let Ro '■= P and Ri+i := Ri © at for ai := 0{Ri). By the initial termination 
argument we know there exists j S N such that the algorithm terminates with 
R = Rj in its last loop iteration. That means that either the check in Line|3]of the 
algorithm succeeds, in which case Algorithm[T]returns true, or the check in LinejT] 
of the algorithm fails, i.e. 0[Rj) = e and Reachsc{Rj) H G = 0. In the latter 
case, by the first oracle condition we know that ReachTsoiRj) H G = 0 and, by 
Lemmalll we get ReachrsoiRj) ^ ReachrsoiRo)■ Then, Reachj'soiP) fl G = 0 
contradicts the above assumption and concludes the proof. □ 


To establish that Algorithm [T] is a semi-decision procedure for all programs, 
one can use an iterative bounded model checking approach. Bounded model 
checking unrolls the input program P up to a bound fc G N on the length 
of computations. Then Algorithm |T] is applied to the resulting programs Pk- 
If it finds a goal state TSO-reachable in Pk, this state corresponds to a TSO- 
reachable goal state in P. Otherwise, we increase k and try again. By Theorem|31 
we know that Algorithm [T] is a decision procedure for each Pk- This implies 
that Algorithm |T] together with iterative bounded model checking yields a semi¬ 
decision procedure that terminates for all positive instances of TSO reachability. 
For negative instances of TSO reachability, however, the procedure is guaranteed 
to terminate only if the input program P is acyclic. 


Theorem 4 We have G O ReachrsoiR) ^ 0 */ <ind only if, for large enough 
fc S N, Algorithm\l\ returns true on input Pk- 


Proof Assume that G O ReachTso{P) ^ 0- Then there exist some state s G G 
and a G Ctso(P) such that sq s. Let k be the length of a and G' be the 
goal states of XTSo{Pk)- There exists a computation /3 e CTSo{Pk) that mimics 
a and reaches s' G G'. Hence, G' 0 ReachTso{Pk) ^ 0 and, by Theorem [31 
Algorithm |T] returns true on input Pk- 

For the reverse direction, assume that Algorithm [1] returns true on input Pk 
for some k gN. Let Sg be the initial state of XTsoiPk) and, as before, G' be the 
goal states of XTso{Pk)- By Theorem |3l there exists s' G G' <1 ReachTso{Pk) 

and /3 G CTSo{Pk) such that s'q s' . Since Pk unrolls P up to bound k, there 
exists a computation a G Ctso(H) that mimics /3 and reaches s G G. Therefore, 
G n ReachTsoiP) 7^ 0- t] 


Lazy TSO Reachability 


11 


4 A Robustness-based Oracle 

This section argues why robustness yields an oracle. Robustness [TlfTUlinil^ is 
a correctness criterion requiring that for each TSO computation of a program 
there is an SC computation that has the same data and control dependencies. 
Delays due to store buffering are still allowed, as long as they do not produce 
dependencies between instructions that SC computations forbid. 

Dependencies between events are described in terms of the happens-before 
relation of a computation r G Ctso(^)- The happens-before relation is a union 
of the three relations that we define below: -^hb (t) '■= —>-po U O U 

The program order relation -^po is the order in which threads issue their 
commands. Formally, it is the union of the program order relations for all threads: 
—^po := UteTiD ~^po- Tet r' be the subsequence of all non-flush events of thread 
t in T. Then —:= <,-/. 

The equivalence relation links, in each thread, flush events and their 
matching store events: (t, inst, a) ■<->■ It, flush, a). 

The conflict relation -^cf orders accesses to the same address. Assume, on the 
one hand, that t — ti ■ store • T 2 • load • T 3 • flush • T 4 such that store o flush, 
events store and load access the same address a and come from thread t, 
and there is no other store event store' G such that thread {store') = t and 
addr(store') = a. Then the load event load is an early read of the value buffered 
by the event store and store —J-c/ load. 

On the other hand, assume r = ri • e • T 2 • e' • ra such that e and e' are either 
load or flush events that access the same address a, neither e nor e' is an early 
read, and at least one of e or e' is a flush to a. If there is no other flush event 
flush G T 2 with ad(ir(flush) = a then e s'. 

Figure m depicts the happens-before relation of computation Twit- 
A program P is said to be robust against 
TSO if for each computation r S Ctso(^’) 
there exists a computation r' G Csc{P) such 
that -^kb {t) =^hb (t')- If a program P is 
robust, then it reaches the same set of final 
states under SC and under TSO: 

Lemma 5 If P is robust against TSO, then Reachsc{P) = ReachTSo{P)- 

Proof. The C inclusion holds by Csc{P) ^ Ctso(^’)- For the reverse, assume 
that there is a TSO computation r G Ctso(F’) such that sq ^ s. Since P is 
robust, there is an SC computation r' G Csc{P) such that -^hb (t) =^hb {P)- 

Then r' G Ctso(P) and, by Lemma |S1 sq s so s is SC-reachable. □ 

Our robustness-based oracle makes use of the following characterization of 
robustness from earlier work [10]: a program P is not robust against TSO iff 
Ctso{P) contains a computation, called witness, as in Figure [5| 

Lemma 6 ( [lOj I Program P is robust against TSO if and only if the set of 
TSO computations Ctso{P) contains no witness. 


storei 

store2 

/ f 

. f \ 

/flushi 

flush2 j 

loadi 

load2 “ 


Fig. 4. The relation -^hb (vwit). 
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A witness r delays stores of only one thread in P. The other threads adhere 
to the SC semantics. Conditions (Wl) - (W4) in Figure [5] describe formally this 
restrictive behavior. Furthermore, condition (W5) implies that no computation 
t' G Csc{P) can satisfy ^hb (r) =^hb (r'). 

The computation r^it is a witness for the program in Figure [TJ Indeed, in no 
SC computation of this program can both loads read the initial values of x and y. 
Relative to Figure [SJ we have store = storei, load = loadi, flush = flushi, 
T 3 — store 2 • flush 2 • load 2 , and ri = T 2 = T 4 = e. 



T = -Store-load-flush- 

n r2 Ta T4 

Fig. 5. Witness r with store flush and thread t := thread{store) = thread (load). 
Witnesses satisfy the following constraints: (Wl) Only thread t delays stores. (W2) 
Event flush is the first delayed store of t and load is the last event of t past which 
flush is delayed. So r 2 contains neither flush events nor fences of t. (W3) Sequence 
T 3 contains no events of thread t. (W4) Sequence r 4 consists only of flush events e of 
thread t. All these events e satisfy addr{e) 7 ^ addr(load). (W5) We require load e 
for all events e in ra • flush. 

The robustness-based oracle, given input P, finds a witness t as in Figure [S] 
and returns the sequence of instructions for the events in store ■ T2 • load that 
belong to thread t. If no witness exists, it returns e. By Lemmas 0 and El this 
satisfies the oracle conditions from Section El Note that, given a robust program 
and the robustness-based oracle as inputs. Algorithm [T1 returns within the first 
iteration of the while loop. 


5 Experiments 

We have implemented our lazy TSO reachability algorithm on top of the tool 
Trencher [1]. Trencher was initially developed for checking robustness and 
implements the algorithm for finding witness computations described in m- Our 
implementation reuses that algorithm as a robustness-based oracle. Trencher 
originally used SPIN HZ] as back-end SC reachability checker. The current im¬ 
plementation, however, uses a simpler model checker that exploits information 
about the instruction set for partial-order reduction. Moreover, it avoids having 
to compile the verifier executables (pan) as is the case for SPIN. 

We have implemented Algorithm [1] with the following amendments. First, the 
extension does not delete the store instruction insti. This ensures the extended 
program has a (sound) superset of the TSO behaviors of the original program. 
Second, the extension only adds instructions along q^,... ,q^. The remaining in¬ 
structions were added to ensure all behaviors of the original program exist in the 
extended program, once insti is removed. The resulting algorithm is guaranteed 
to give correct results for cyclic programs. Of course, it cannot be guaranteed 
to terminate in general. Finally, our implementation explores extensions due to 
different instruction sequences in parallel, rather than sequentially. 







Lazy TSO Reachability 


13 


We compare our prototype implementation against two other model checkers 
that support TSO semantics: Memorax m (revision 4f94ab6) and CBMC [Tl] 
(version 4.7). Memorax implements a sound and complete reachability checking 
procedure by reducing to coverability in a well-structured transition system. 
CBMC is an SMT-based bounded model checker for C programs. Consequently, 
it is sound, but not complete: it is complete only up to a given bound on the 
number of loop iterations in the input program. 

5.1 Examples 

We tested our tool on a 
set of examples. Figure [5] 
summarizes characteristics 
of the examples taken from 
the initial Trencher tests: 
number of threads (T), 
states (St), and transitions 
(Tr). The first example is a Fig. 6 . Trencher benchmarking results. The tests 
model of the buggy Parker available online [I]. Times here are in milliseconds, 
class from Java VM [15]. The next three examples are mutual exclusion proto¬ 
cols implemented via shared variables. These protocols do not guarantee mutual 
exclusion under TSO. We tested Dekker’s and Peterson’s algorithms for two 
threads, and Lamport’s fast mutex [^ for three threads. The last three tests 
from Figure El give statistics concerning reachability in robust test cases for the 
lock-free stack, and for the MCS and CLH locking algorithms from m- 

We also performed three parametrized tests. First, we varied the number 
of threads in Lamport’s fast mutex [35] (see left-hand-side of Figure Ej). The 
modified Dekker in Figure |S| is inspired by the examples of the fence-insertion 
tool MUSKETEER [3] and adds an “iV-branching diamond” (see right-hand-side 
of Figure [H|) to both program threads. Lastly, the program in Figure E) places 
stores to address x on a length N loop in thread ti: since ti expects to load the 
initial y value while t 2 expects to load 1 and then 0 from x, an execution that 
reaches the goal state goes through the length N loop twice. 

5.2 Evaluation 

We ran all tests on a QEMU @ 2.67GHz virtual machine (16 cores) with 8 GB 
RAM running GNU/Linux. The table in Figure [ 6 ] summarizes the results of the 
Trencher benchmark tests. RQ is the number of SG reachability queries raised 
by Trencher. The columns GPU and Real give the total CPU time and the 
wall-clock time for performing a test. 

The first graph in Figure Uni depicts the running times of the three tools 
on the non-robust examples from Figure | 6 | For CBMC, we used the versions 
of the mutual exclusion algorithms that its authors provide. For Memorax, 
we hand-wrote *. rmm Hies for the first 4 test programs. We did not perform 
a comparison for robust programs: if SC reachability returns false on an input 


# 

Program 

T 

St 

Tr 

RQ 

CPU 

Real 

1 

Parker (non-rob) 

2 

11 

10 

4 

8 

5 

2 

Peterson (non-rob) 

2 

14 

18 

12 

21 

13 

3 

Dekker (non-rob) 

2 

24 

30 

30 

171 

70 

4 

Lamport (non-rob) 

3 

33 

36 

27 

1839 

694 

5 

MCS Lock 

4 

52 

50 

30 

127 

61 

6 

CLH Lock 

3 

43 

41 

70 

10 

7 

7 

Lock-Free Stack 

4 

46 

50 

14 

9 

7 
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Fig. 7. The i-th Lamport mutex thread (left) and running times for N threads (right). 


ti -^0 
mem[x] 1 


9 


OAr(a) 


Q 


ri mem[y] 


Q 


assume ri = 0 


U -^0 
mem[y] ■<— 1 


9 


OAr(fo) 


r 2 ■(— mem[x] 


Q 


assume r 2 = 0 


entry 

OAr(a): O 

r -(r- mem [a] 



Q assume r = 0 / \ assume r — f Vi G — 1] 




mem[a] <— l\^ J mem[a] (i + 1) mod N 

O 

exit 


Fig. 8. Dekker’s algorithm modified so that an ‘W-branching diamond” over distinct 
addresses a,b ^ {x, y} is placed between the accesses to x and y. A final goal state is 
TSO-reachable if the first store is delayed past the last load in either ti or t 2 . 



Q Q 

Fig. 9. A final goal state is TSO-reachable if ti goes through the (length N) loop two 
times: once to satisfy assume rs = 1 and the second time to satisfy assume ra = 0. 


program, our implementation decides mutual exclusion as fast as Trencher is 
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Test order index from Figure]^ N — diamond branching factor 

Fig. 10. Running times for the non-robust tests in Figure |6] (left) and Figure[8] (right). 

able to determine robustness. Moreover, CBMC implements strictly an under- 
approximative method where the number of loop iterations is bounded. Our 
robust tests, however, contain unbounded loops. 

The high load needed to verify Lamport’s mutex — in comparison with the 
other Figure H] tests — is justified by the correlation between the program’s 
data domain size and its number of threads. For a larger number of threads, 
the right-hand-side graph in Figure [7] shows that CBMC is fastest. This is the 
case since, actually, the smallest unwind bound suffices for CBMC to conclude 
reachability. For Memorax and Trencher the system runs out of memory 
when N = 5. This underlines once again just how troublesome the state-space 
explosion is for TSO reachability. Although it is not easily noticeable in the 
picture, Memorax’s exponential scaling is better than Trencher’s: although 
Trencher is slightly faster than Memorax for N € {2,3}, Memorax clearly 
outperforms Trencher when = 4. 

The graph in Figures |9] show that, for the second parameterized test, our pro¬ 
totype is faster than CBMC. Indeed, with increasing N, an ever larger number 
of constraints need to be generated by CBMC. For Trencher, regardless of the 
value of N, it takes three SC reachability queries to conclude TSO reachability. 

The second graph in Figure [TU] shows that, for the programs described by 
Figure |S1 our prototype is faster than Memorax. It seems Memorax cannot 
cope well with the branching factor that the parameter N introduces. 

To better understand the difficulty of the latter two parametric tests, we 
present the exponential scaling behaviors of Trencher in Figure [TTJ 


5.3 Discussion 

Because we find several witnesses in parallel, throughout the experiments our 
implementation required up to 2 iterations of the loop in Algorithm [1] In the 
case of robust programs, one iteration is always sufficient. This suggests that 
robustness violations are really the critical behaviors leading to TSO reachability. 

The experiments indicate that, at least for some programs with a high branch¬ 
ing factor, our implementation is faster than Memorax if a useful witness can be 
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0 
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Fig. 11. Additional Trencher results for the programs in Figures |8] and [Q] Memorax 
takes already 1 minute and 24 seconds for the program in Figure [S] and N = 50, while 
CBMC takes 8 minutes and 35 seconds for the program in Figure and V = 20. 

found within a small number of iterations of Algorithm [TJ Similarly, our proto¬ 
type is better than CBMC for programs which require a high unwinding bound 
to make visible TSO behavior reaching a goal state. Although the two programs 
by which we show this are rather artificial, we expect such characteristics to 
occur in actual code. Hence, our approach seems to be strong on an orthogonal 
set of programs. In a portfolio model checker, it could be used as a promising 
alternative to the existing techniques. 

To evaluate the practicality of our method, more experiments are needed. In 
particular, we hope to be able to substantiate the above conjecture for concrete 
programs with behavior like that depicted in Figures [5] and [5] Unfortunately, 
there seems to be no clear way of translating (compiled) C programs into our 
simplified assembly syntax without substantial abstraction. To handle C code, an 
alternative would be to reimplement our method within CBMC. But this would 
force us to determine a-priori a good-enough unwinding bound. Moreover, we 
could no longer conclude safety of robust programs with unbounded loops. 
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A A Simple Safe Program 

The program from Figure [T^] is safe since no goal state is TSO-reachable: the 
initial control states will never be left since the conditionals will never succeed. 
However, the algorithm that we describe for Theorem |4] does not terminate for 
this example. Although every Pk that unrolls the program in Figure [12] up to 
fc G N is found safe, the algorithm only stops if a TSO-reachable state is found 
or if 0{R) = e, which is never the case. 

mem[x] •<— 1 — ri mem[y] •<— 1 — r 2 



ri <— mem[y] 



Fig. 12. A safe program for which Algorithm [T] (as in Theorem |3)) does not terminate. 

The underlying reason why always 0{R) ^ e is that there are infinitely 
many sequences ■ mstioad, where mststore = ( 9 o,i) niem[x] <— 1 — ri, qo,i), 

instioad = ri ^ mem[y], qo,i), and m G N. 

B Proofs missing in Snbsection 13.11 

Prior to proving Lemma [I] we do a bit of preparation. We rely on computations 
that delay flush events locally the least. Lemma |7| explains what this means. 

Lemma 7 Let a G Ctso(.R) o-nd t G TID. There exists a G Ctso{R) such 
that -^hb (a) = -^hb (d) and, for all events egtore ^ ©flush within thread t, if 

o^ft .— (^prefix * ©store * ' ©flush ' ^suffix then either 

(1) a' := P ■ eioad • P' and all events e G /3' are flushes, 
or (2) all events e G a' are local assignments or conditionals 

Proof. Intuitively, the theorem states that flush events of thread t delayed past 
same-thread local events, may be delayed less without changing the happens- 
before relation of the computation. Local events are assignments, conditionals, 
and store events in the same thread. 

Let OL 1 = OL\' Ostore ' cr 2 * e ' 03 • ©flush * ^4 such that ©store ©flush are events 
of thread t, © is a local event in t and thread{e') 7 ^ t for all events ©' G 03 . 

We denote by oq := oi • ©store • 02 ■ era ■ ©flush • © ■ 04 the TSO computation that 
first performs the flush ©flush and then the event e. Notice that since as contains 
no events e' with thread{e') = t, feasibility of computation ao is ensured and 
-^hb (a) =^hb (ao) holds. 

Starting with the last flush event in a, we use the above reordering of events 
© to locally delay flush events less. In the end we obtain computation d in which 
no flush event of thread t can be locally delayed less. □ 

Furthermore, in order to reference instructions of i? © u that the extension 
adds we give an alternative description for some of the transition sequences in 






Lazy TSO Reachability 


19 


the main text. Recall that variable count keeps track of the number of store 
instructions processed along cr. 

If cmd{insti) = meni[e] e', we said count is incremented and instructions 
that remember the value and address written in atcount and vtcount are added. 

_ account ^ C ttcount ^ C _ 

o-<7* (1) 

If cmd{insti) = r mem[e] we said instructions are added that load from 
memory only when a load from the simulated buffer is not possible. More pre¬ 
cisely, if some j € [1, count] such that arj = e is found, r is assigned the value 
of vvj. Otherwise, the register r receives its value from the address e. 



Alternatively, assuming (/check i count •= 9 i-i; this can be stated as adding 


{('/cheeky,count; aSSUme atcount — 6 ) '/buf,i,count)} ( 2 ) 

{('/check, i,count 5 ^-SSUniG (XT’count ^ ^5 9check,i,count —1 )} (3) 

W {(9buf,i,count; r ^ Utcount; 9*)} (4) 

w {( 9 check,*,i; assume an = e, gbuf,i,i)} (5) 

W {(9check,*,i; assume an -h e, q^em,i)} (6) 

W {(9buf,i,l. ^ ^ 9*)} (7) 

w {(9mem,*; ^ ^ mem[e], qi)} ( 8 ) 


We said that out of control state we create a sequence of stores to flush the 
contents of the auxiliary registers and return to the code of the original thread. 

mem[ari] t- vri mem[armax] t- , 

(/„ O-^^->-0 • • • O-^^->-0 dst{mstn) 

Alternatively, we could have stated it as adding 

{(g„, mem[ari] ^ uri, gflcish,i)} (9) 


w {(</flush,max-i; mem[ar„ax] t-v^max, dst{instn))} 


( 10 ) 
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Furthermore, for all instructions inst G It with src{inst) = src{insti) for 
some i G [l..n] and for which inst ^ insti we added instructions that flush the 
stores buffered in the auxiliary registers and return to dst{inst). 

_ mem[ari] ■(— vri mem[arcouiit] t— urcount cmd{inst) 

Qi O-- - ->-0 ■ • ■ O-—- >-0 ---»-0 dstiyinst) 

Alternatively, we could have stated it as adding 

{(9i, mem[ari] ^ uri, g„ext.z,i)} (H) 

{ (^next,z,count — 1 ; ni®ni[^^comit] t UTcounti ^next,z,count)} 

w {(9next,z,count: cmd{inst), dst{inst))} (13) 

Finally, for all load instructions insti, where i < n, as well as out of Tji we 
added instructions that flush and fence the pair (ari, vri ), make the remaining 
buffered stores in the auxiliary registers visible, and return to q. Here q := 
src(insti) in the load case and q := dst(insti) otherwise. 

mem[ari] •<-uri mfence mem [or count ] ^ nCcount 

qi O---O • ■ ■ O-9 

Alternatively, we could have stated it as adding 

{(q„ mem[ari] t- vn, fence,*)} 

'tl {(^fence,*: mfence, ^orig,z,2)} 
w {(9orig,z,2: mem[ar 2 ] ^ ura, feig.i.s)} 

w {(^orig.i,count. mem[arcount] ^ ^^rcount, q)} 

We can now turn to the actual proof of Lemma [T] 

Proof (of # 1). Assume t is the thread of cr := insti • ... • instn, -Ytso(-R ® o’) := 
(E^,S(^, Z\tso. se. F®), I and Q are the instructions and states of R, DOM and 
REG are registers and addresses used by R, and /© are the instructions Ij. of 
i? © O’ as described in Section [3] 

A direct result of Lemmas 0 and |S] is that TSO computations of R that delay 
flushes of t locally the least reach all the states in the set ReachTSo(R)■ Assume 
a G Ctso(R) is a computation where flushes of t are delayed locally the least as 
Lemma [7] describes and let sq. ■ ■ ■. Sm G S'tso for some to G N be all the states 
along the transition sequence sq s, i.e., Sq := sq and Sm '■= s. Also, for all 
k G [0,to], let Ofc denote prefixes of a with sq Sfc. 

We prove by induction over state indexes k G [0, to] that there exist prefixes 

Pk of (3 G Ctso(R © O’) and states Sq, ... ,s'^ G S'© along s© A- s' G A^g© with 
Sg := s© and s'^ := s' such that the following invariants hold: 


(14) 

(15) 

(16) 

(17) 
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Inv-0 So (pc, val, buf) and S 0 (pc', val', buf'). 

Inv-1 If pc and pc' differ then they only differ for thread t. Moreover, if 
pc(f) ^ pc'(<) then pc(f) = dst{insti) and pc'(t) = for some i G [l..n — 1 ]. 

Inv-2 val'(a) = val(a) for all a € DOM U REG. 

Inv-3 buf and buf' differ at most for t. Furthermore, if buf(t) ^ buf'(t) then 
pc'(t) = for some i € [l..n - 1 ] and buf(f) = (dfy^t, ■ • ■ • ■ (dri, dri) • 

buf'(t) where count stores are seen along a from src{insti) to dst[insti). 

For the induction base case /c = 0, ao = e, sq = sq, pc = pCg, val = valg, 
and buf = bufp. Then, for /3o := e and Sq = Sq, invariants Inv-0...3 hold. 

For the induction step case, assume that invariants Inv-0...3 hold for 
k < m and that Sk A- s^+i := (pc+, val+, buf+) for some e G E. We use a 

case distinction over possible events e to define f3k+i such that Sq := 

(pc(|_, val(,_, buf(^) and invariants Inv-0...3 hold for fc + 1. 

If thread{e) := t' ^ t it means inst{e) G is enabled in pc'(t'), so there 
exist e' G £1® and G such that mst(e') := mst(e) and (sj., e', G 

Z\tso in ^TSo(^ © ct)- We define /3fe+i := Pk ■ e' and find that, by the Z\tso 
semantics (Figure [5]) and under the assumption that invariants Inv-0...3 hold 
for k, invariants Inv-0...3 also hold for k + 1. 

If thread{e) = t we make the following case distinction over e and pc'(t). 

[T] “e is a flush event.” This first case deals with the possibility that a store 
operation is flushed. Depending on whether buf'(t) ^ e, we either flush the oldest 
address-value pair of buf'(<) or the first address-value auxiliary registers pair. By 
Lemma [7l the later case can only happen when pc'(t) = q^ for some i G [2..n — 1] 
and insti performs a load or f = 1 . 

If buf'(t) ^ e we flush the oldest write access buffered. Namely, let efiush G i?© 
and G 5© such that, according to rule (WM), (s'^,, efiush, Sfc+i) S zItso- 
We define (3k+i ■= Pk ■ Sfiush and invariants Inv-0...3 hold for fc -f 1 since 

(0) Inv-0,3 hold for k so sq -At©, sAi> implying Inv-0 

holds for fc -I- I. 

(1) Inv-1 holds for k, pc+(f) = pc(f), and pc(^(t) = pc'(f), so Inv-1 holds for 
fc -I- 1. 

(2) Inv-2,3 hold for k, so events e and enush update the same address by a 
same value and Inv-2 holds for fc -|- 1. 

(3) Inv-3 holds for k and events e and enush remove one address-value pair 
from both buf(t) and buf'(t), so Inv-3 holds for fc -|- 1. 

Otherwise, buf'(t) = e and count stores are encountered from src{insti) to 
pc'(t) = q^ for some i G [l..n - 1]. Then buf(t) = (dry^tjbK^t) • • ■ • • (dA,'cA) 
and, by Lemma [7l we know insti is either the first store insti of cr or a load. 
Either way, let ei,..., Ocount, Sfiush, ©fence G if© match equations dTiHTTll in the 
extension and s'j.j^^ G 5"© such that events Oj are, for all j G [1..count], the 
buffering events for the stores (ITilTM TTl). ©flush is the flush event for the store 
(O, ©fence is the event for the fence (fT5|) . and s'^ efi„t efe„ce-e 2 ■■■ s'f._^^ G 

Z\©go according to rules (ST,MEM,F) in Figure [H We then define Pk+i '■= 
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• ei ■ Gfiush • Gfence ' 62 ■ ... • ecount ' © and find that invariants Inv-0...3 hold 
for k + 1 since 

(0) Inv-0,3 hold for k so sq Sfe+i and Sq s5j_|_^, i.e. Inv-0 holds 

for k + \. 

(1) Inv-1 holds for k and pc 4 .(f) = q = pc^(t), where q := src{insti) if insti 
is a load and q := dst{insti) otherwise, so Inv-1 holds for /c + 1. 

(2) Inv-2,3 hold for fc, events e and efiush update the same address by the 
same value and, since the other events do not update any address, Inv-2 holds 
for k + 1. 

(3) Inv-3 holds for k, and events 62 ,..., ecount place the corresponding 
address-value pairs that match buf+(t) into buf^(t), so Inv-3 holds for k -\-l. 

[~^ “e is not a flush event, pc'(t) = q^ for i G [l..n — 1], inst{e) ^ insti+i” 
Event e corresponds to an instruction that does not follow a. Then, events for 
instructions (HiHni) place the auxiliary address-value pairs into buf^(t) and 
then perform cmd{inst{e)). Let ei,..., ecount, G and s'f.j^^ G Sq such that 
Qj are, for all j G [1..count], the buffering events for stores e' is the 

event for instruction (ITSl) . and s'f. ^ 2 , -^ g ^tsO’ according to the 
Figured) rules. We define jdk+i '■= Pk ■ ■■■■ ■ ©count ■ and find that invariants 

Inv-0...3 hold for fc -I- 1 since 

(0) Inv-0 holds for k so sq st+i and Sq i.e. Inv-0 holds for 

fc -I- 1. 

(1) Inv-1 holds for k and pc+(t) = dst{inst{e)) = pc]^(<), so Inv-1 holds for 
fc -f 1. 

(2) Inv-2 holds for k and the events e and e' update at most one REG register 
by the same value, so Inv-2 holds for fc -|- 1. 

(3) Inv-3 holds for k, the buffering store events ei,..., ecount make the 
address-value pairs of the auxiliary registers explicit in buf^(t), and if events 
e and e' are buffering events for stores then they add the same address-value 
pair, so Inv-3 holds for fc -|- 1. 

[~3] “mst(e) performs a store and fails.” We analyze the following sub¬ 
cases depending on the value of pc'(f). 

I 3a I “pc'(t) = q^_l for some i G [l..n — 1].” Since does not hold, inst{e) = 
insti and auxiliary registers track the store insti. Let e^, G E© be events for 
the instructions in ([T]) and G 5© such that G according 

to the zItso rule for local assignments. We define Pk+i ■= (3k ■ ■ ©i, and find 

that invariants Inv-0...3 hold for fc -|- 1 since 

(0) Inv-0 holds for k so sq Sfc-i-i and Sg i.e. Inv-0 holds for 

k ~\- 1. 

(1) Inv-1 holds for fc, pc+(f) = dst{insti), and pc©(f) = so Inv-1 holds 
for fc -b 1 . 

(2) Inv-2 holds for fc and no memory changes occurred outside of auxiliary 
registers, so Inv-2 holds for fc -b 1. 

(3) Inv-3 holds for fc and rlrcmmt) matches the address-value pair 

added by e to buf©(t), so Inv-3 holds for fc -b 1. 
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3b 


“pc'(t) = pc(t) ^ src{insti)” 
thread{e) ^ t since inst{e) G 


This case is similar to 
. Then there exist e' € Ecf 


the one when 


and 


’fc+i 


€ Sa 


such that inst{e') = inst{e) and (s^, e', G ^ixso in ^tso(^©ct). We de¬ 

fine Pk+i '■= Pk ■ e' and find that, by the /ixso semantics (Figure (2), invariants 
Inv-0...3 continue to hold for A: -|- 1. 

[4~| “mst(e) performs a load and fails.” We analyze the following sub¬ 
cases depending on the value of pc'(t). 

I 4a I “pc'(t) = for some i G [l..n — 1 ].” Since does not hold, inst{e) = 
insti and we use (H1 17I8I) to load from e only when no register arj matches e for 
any j G [1..count]. 

If there exists a largest j G [1..count] such that arj = e then r will take 
its value from the auxiliary register vrj. Let ecount, • ■ •, eassign G and 
G S'© such that efc are, for all k G [j + 1..count], the events for negative 
conditional checks (TO . Gj is the event for the earliest positive conditional check 

(|2I5I1 . Gassign is the event for an instruction (I4I7I1 . and sj. ■■ i ^ 

^TSO according to the rules for conditionals and local assignments in ZItso ■ We 
define Pk+i ■= Pk ■ Scomit ' ■ ■ • ' Sj ' Sassign and find that the invariants Inv-0...3 
hold for fc -I- 1 since 

( 0 ) Inv-0 holds for k so sq 
fc -I- 1. 

( 1 ) Inv-1 holds for fc, pc©(t) = dst{insti), and pc()_(t) = so Inv-1 holds for 
fc -t- 1 . 


and Sq i.e. Inv-0 holds for 


(2) Inv-2 holds for fc, 
other event Gcount, • ■ •, 


both G and 


-assign 


update r by the same value, and no 


changes any address, so Inv-2 holds for fc -|- 1. 


(3) Inv-3 holds for 
fc -I- 1. 


fc and no event alters buffer contents, so Inv-3 holds for 


Otherwise, avj ^ e holds for all j G [1..count] and the register r will take its 


value from the address indicated by e. Namely, let 


-count 5 • ■ 


iSliGioad G Eq. 


and 


G Sm such that 


are. 


c UllClU efc 

negative conditional checks TO 7 ®load 


for all fc G [1..count], the events for 
is the event for instruction m, and 


and (LB/LM). We define Pk+i ■= 

Inv-0...3 hold for fc -|- 1: 

(0) Inv-0 holds for fc so sq EEEj. and Sq 

k + 1 . 

(1) Inv-1 holds for fc, pc+(t) = dst{insti), and pc'|.(t) = g^, so Inv-1 holds for 
fc -I- 1. 

(2) Inv-2 holds for fc, both e and Gioad update r by the same value, and no 
other event Gcount, • ■ •, ©i changes any address, so Inv-2 holds for fc -|- 1. 

(3) Inv-3 holds for fc and no event alters buffer contents, so Inv-3 holds for 
fc -I- 1. 

4b “pc'(t) = g„_i.” Since does not hold, inst{e) = instn- Furthermore, be- 


G Z\^gQ according to the rule for conditionals in Z\tso 
S count •... • ei • Gioad and find that invariants 


■> s'fc+u i-e. Inv-0 holds for 


cause count = max, ad dition ally to performing the events that simulate the load 
behavior as in subcase | 4a |, the extension returns to the original program flow 
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using events for dSHini) and makes the auxiliary registers address-value pairs 
explicit in buf'^(t). 

Let e']^,..., € Ei^ and € S'© such that e'^ are, for all k € [l..max], 


the buffering events for stores (I9I10|1 . and 


^ G ^TSO according 


to (LS) from Figured! with being notation for from 4a . We define 
Pk+i ■■= Pk+i • e'l • ... • e, 

that the invariants Inv-0...3 hold for fc -I- 1 since 


where is notation for Pk+i from 4a , and find 


(0) Inv-0 holds for k so sq s^+i and Sg 

fc -I- 1. 


i.e. Inv-0 holds for 


(1) Inv-1 holds for k and pc^{t) = dst{instn) = pc^,_(t), so Inv-1 holds for 
fc -I- 1. 


(2) Inv-2 holds for fc, both events e and eioad update r by the same value, and 
, ei, e'^, ..., changes any address, so Inv-2 holds for 


no other event ecount) ■ 
fc -I- 1. 


(3) Inv-3 holds for fc and events 


U) ■ 


place the corresponding address- 


value pairs that match buf+(t) into buf^(t), so Inv-3 holds for fc + 1. 


4c “pc'(t) = pc(t).” This case is similar to 3b . Let e' G E^ 

Sfc+i) G Z\tso in -^tso(-R 


such that inst{e') = inst{e) and 


and G S(f 


a). We 


define fik+i '■= Pk ■ and find that, by the Z\tso semantics (Figure [2|), the 
invariants Inv-0...3 hold for fc -|- 1. 

■‘e performs an assignment, conditional, or memory fence and 


fails.” 


We analyze the following subcases. 

pc'(t) = for i e [l..n — 1].” Since does not hold, inst{e) = insti is 


5a 


either a conditional or an assignment. 

If cmd{insti) = r ■<— e let e' G E:^ and G S'® such that inst{e') = 

{qi-i, r t— e, qi) and (s'^, e', sj.®!) G /\tso by the zItso rule for local assign¬ 
ments. We define Pk+i ■= Pk ■ and find that the invariants Inv-0...3 hold for 
fc -|- 1 since 


Ifc + l , / Pk + l 

- > Sfe+1 and Sg - > s 


k+l^ 


i.e. Inv-0 holds for 


(0) Inv-0 holds for fc so sq 
fc -f 1. 

(1) Inv-1 holds for fc, pc+(t) = dst{insti), and pc(^(t) = q^, so Inv-1 holds for 
fc -b 1. 

(2) Inv-2 holds for fc and e is evaluated the same by both e and e', so the 
register r is updated by the same value and Inv-2 holds for fc -b 1. 

(3) Inv-3 holds for fc and no event alters buffer contents, so Inv-3 holds for 
fc -b 1. 


Otherwise, cmd(insti) = assume e. Let e' G Fi® and s(._|_j^ G 5® such that 
inst{e') = (gi-i, assume e, qi) and (s(., e', s(.^^) G Z\tso by the Z\tso rule for 
conditionals. We define Pk+i '■= Pk ■ e' and find that the invariants Inv-0...3 
hold for fc + 1 since 

(0) Inv-0 holds for fc so sq Sfe-i-i and Sg i.e. Inv-0 holds for 

fc + 1. 


(1) Inv-1 holds for fc, pc+(t) = dst{insti), and = q^, so Inv-1 holds for 

fc -b 1. 
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(2) Inv-2 holds for k and both e and e' do not change any address, so Inv-2 
holds for fc + 1. 

(3) Inv-3 holds for k and no event alters buffer contents, so Inv-3 holds for 
fc + 1. 

5b “pc'(t) = pc(t).” This case covers the remaining possibilities when e is an 
assignment, conditional, or memory fence. Similar to cases 3b and 


4c 


let 

e' € E(^ and G Sq such that mst{e') = inst{e) and (sj., e', G 4 \tso 

in Xtso(-R® O’)- We define Pk+i '■= Pk-e' and find that, by the Z\tso semantics 
(Figured]), invariants Inv-0...3 hold for k + 1. 

The above case distinction covers all possibilities for events e that a may 
perform from Sk- Hence, by complete induction, the extension does not remove 
TSO-reachable states: if s = (pc,val,buf) is reachable by a then there exists 
s' — (pc', val', buf') and /3 G Ctso(^ © o’) such that s' is reachable by /3 in i?0(T, 
pc = pc', val(a) = val'(a) for all a G DOM U REG, and buf = buf' are empty. 

For the reverse direction, let fr- Ctso(^) —t Ctso{R (B t) be the map 
a ^ (3 that the inductive proof implies, respectively fr' E ^ E^ its re¬ 
striction to events matching the different inductive cases. Furthermore, consider 
computations /3 G Ctso(-R®o’) that do not interleave events of other threads 
within the events of sequences /r(e). Such computations reach the entire set 
ReachTsoiR ® o’). E.g., since local events ecount 


, ei as m case 


4a 


that pre¬ 
cede eioad can be performed right before eioad, the above restriction does not 
change the set of TSO-reachable states in i? 0 tr. Note that fr is a bijection 
between such computations /3 and computations a G Ctso(-R) that delay flushes 
locally the least wrt. t. Another induction can show that for each computation j3 
as described above there exists a computation a G Ctso(-R) such that invariants 
Inv-0...3 hold for prefixes of /3 and a. This implies that the extension by a does 
not add TSO-reachable states. □ 


C TSO Semantics and Proofs missing in Section 0] 


Figure uni describes the full TSO semantics. For completeness, states s G S'm use 
the additional event counter ec: TID —)• N to identify events. This is used, e.g., 
to define matching stores and flushes and does not affect in any way our results. 
As mentioned in subsection 12.11 under SC, stores are flushed immediately: 


cmd = mem[ea] ■<— 


id = ec{t) 


(t,id,inst,a) {t,id,fiush, a) 


> (ec', pc', val[a := a], buf) 


(LSWM) 


Lemma 8 If a, f) G Ctso{P), sq s, and -^ub (o:) =^hb (/3) then sq A- s. 

Proof. Assume sq A s'. Since a and /3 have the same program order —fpo, 
it means s and s' have the same index counter ec and program counter pc. 
Moreover, since a and j3 have the same conflict order —fc/, s and s' have the 
same memory valuation val. Finally, since computations a and j3 empty the 
buffers, s and s' have empty buffers. In conclusion, s = s'. □ 
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cmd = r •<— mem[ea], 


buf(t)4,(N X {a} X DOM) = {id, a,v) ■ (3 


(t,ec(t),mst,a) , , , , 1 u 

s - > (ec , pc , val[r := uj, but) 


(RB) 


cmrf = r •«—mem[eo], a = ea, buf(t)4,(N x {a} x DOM) = e, w = val(a) 

(t,ec(t),inst,a) , , , , 1 . rx 

s - y (ec , pc , val[r := uj, but) 


(RM) 


cmd — mem[ea] t— e„, a = Ca, v = e^i, id = ec{t) 

^zd jiTtst ^Cl'^ , t * _ X . _ X \T\ 

5- > (ec , pc , val, buf[t := (id, a, v) • buf(t)J) 


(LS) 


buf(t) = P ■ {id, a, v) 


s (ec, pc, val[a := w], buf[t := /3]) 


(WM) 


cmd= mfence, buf(t) = e 

(t,ec(t),inst,±) , , , I u 

s - y {ec , pc , val, but) 


(LF) 


cmd = r 


{t,ec{t),ms£,_L) . f f , 1 u 

s -^ (ec , pc , val[r := v\, but) 


(LA) 


cmd = assume e, e ^ 0 

(t,ec(t),inst,±) , , , , , rx 

s - y (ec , pc , val, but) 


(LC) 


Fig. 13. Transition rules for Xtso(R) assuming s = (ec, pc, val, buf) with pc(t) = q, 
inst — q q' in thread t, ec' = ec[t := ec{t) + 1], pc' = pc[t := q'\. We use e to 
evaluate e under val and buf(t)4.(N x {a} x DOM) for stores in buf(t) that access a. 



